Skip to content

Intake-ESM Integration based on #1218#2690

Open
charles-turner-1 wants to merge 37 commits intomainfrom
intake-esm
Open

Intake-ESM Integration based on #1218#2690
charles-turner-1 wants to merge 37 commits intomainfrom
intake-esm

Conversation

@charles-turner-1
Copy link

@charles-turner-1 charles-turner-1 commented Mar 13, 2025

Description

  • Add intake-dataset class to load datasets via intake.
  • Update config-developer.yml to include intake datasets.

TODO:

  • Our intake catalogs here on Gadi have a bunch of extra keys (facets) that I haven't mapped. Is there any documentation on where to find all potential facets that ESMValCore might accept & what they represent? I've been struggling to find them.
  • Tests - presumably the obvious place to stick these is in tests/unit/test_dataset.py, or is it preferable to add a new test module? I'll hold off writing these until I work out the facets issue.
  • Structure: I've put this in an intake submodule, but I could move it intodataset if that's preferable? Also affects previous point.

Have requested a review but obviously this is nowhere near ready to go on the infrastructure side wrt. tests, etc. A couple pointers in the right direction and that stuff should fly along.

Closes #31

Link to documentation:


Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@codecov
Copy link

codecov bot commented Mar 13, 2025

Codecov Report

❌ Patch coverage is 94.56522% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.61%. Comparing base (2792ad1) to head (27aa007).

Files with missing lines Patch % Lines
esmvalcore/io/intake_esm.py 94.56% 5 Missing ⚠️

❌ Your patch check has failed because the patch coverage (94.56%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2690      +/-   ##
==========================================
- Coverage   95.62%   95.61%   -0.01%     
==========================================
  Files         266      267       +1     
  Lines       15601    15693      +92     
==========================================
+ Hits        14918    15005      +87     
- Misses        683      688       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@bouweandela bouweandela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see progress on this @charles-turner-1!

SYNDA: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
NCI: '{activity}/{institute}/{dataset}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}'
input_file: '{short_name}_{mip}_{dataset}_{exp}_{ensemble}_{grid}*.nc'
catalogs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan was to not further extend config-developer, but rather move this to the new configuration that lives in ~/.config/esmvaltool. See #2371 for an example of what we thought the configuration should look like.

- /g/data/oi10/catalog/v2/esm/catalog.json
facets:
# mapping from recipe facets to intake-esm catalog facets
# TODO: Fix these when Gadi is back up
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also test on DKRZ Levante, the intake catalogs are located at /pool/data/Catalogs/dkrz_cmip6_disk.json

return ([_CACHE[cat_url] for cat_url in catalog_urls], facet_list)


class IntakeDataset(Dataset):
Copy link
Member

@bouweandela bouweandela Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having some reservations about subclassing the Dataset class for this purpose:

  • A typical use case for many of our users will be that they have most data available from a central catalog that is managed by a central administrator, but want to augment that with the ability to download some files themselves. In that case, it is really useful to have the ability to deduplicate (e.g. pick the latest version of a file). I'm not sure if this can be achieved by subclassing the Dataset object.
  • We will likely want to add support for other catalogs as well, e.g. intake-esgf, xcube, and STAC. If we need a new Dataset class for each of these, it may become confusing to users.
  • How will this work from the recipe?

As an alternative, would it be an option to load the available data sources from the configuration / Dataset.session and then make the Dataset.files method loop over the available sources and deduplicate input files?

@bouweandela
Copy link
Member

bouweandela commented Mar 21, 2025

Is there any documentation on where to find all potential facets that ESMValCore might accept & what they represent?

ESMValCore is quite flexible with what facets it accepts. We have a translation between some of 'our' facets and the official ones in the esmvalcore.esgf.facets module (this is the subset that we use to search for files on ESGF). A few facets are used by ESMValCore for specific purposes such as CMOR checks and fixes (off the top of my head that would be dataset, project, mip, short_name), but others are entirely free-form and only used for finding input files and defining the output file names using the paths described in the config-developer.yml file.

Our intake catalogs here on Gadi have a bunch of extra keys (facets) that I haven't mapped.

If these are completely determined by the other facets, you can add them automatically using the extra facets facility

@bouweandela
Copy link
Member

Structure: I've put this in an intake submodule,

How about adding a new module called e.g. esmvalcore.data or esmvalcore.data_sources or something similar and adding it as a submodule there? We could also move the esmvalcore.local and esmvalcore.esgf modules there (does not have to be in this pull request). I foresee us adding multiple input data sources in the near future.

@charles-turner-1
Copy link
Author

Thanks for the review Bouwe, super helpful! I've only had a skim so far, but I'll get those suggestions incorporated next week

@bouweandela
Copy link
Member

I started working on adding some interface code that could be useful here too in #2765.

@charles-turner-1
Copy link
Author

Cheers, I'll take a look when I get the chance! Gonna talk to Martin Durant (author of Intake) in ~10 days so hopefully this PR should pick up stone steam after then, I'll be working on this stuff more actively.

@bouweandela
Copy link
Member

This should be a lot easier now that #2765 has landed. You could take the esmvalcore.io.intake_esgf module as an example and add a configuration file similar to data-intake-esgf.yml.

@valeriupredoi
Copy link
Contributor

@charles-turner-1 I popped the latest main here, that includes #2765 - do you reckon you'll have time to restart the work on it soon, mate? If not, no biggie, just pls let @bouweandela and myself know - we can take it from here, there is a bit of a tight schedule on getting full Zarr support (not only as a simple load via esmvalcore IO), and I reckon this is superuseful towards that 🍻

@charles-turner-1
Copy link
Author

Been hoping to get back to this for a while... I just keep managing to find more urgent stuff to get in the way. Me & @rbeucher will be in Canberra together next week, so hopefully we can get a handle on our priorities then.

@charles-turner-1
Copy link
Author

Just a heads up that I'm getting back to this now - will reach out if I have any issues!

@charles-turner-1
Copy link
Author

charles-turner-1 commented Feb 5, 2026

  • Seems like the ordering of IntakeEsmDataSource.find_data is not stable, tests failing in CI (different outcome between runs...) are passing locally. (I'm working on this right now).
  • Test sample data is not being found either, which is not something I expected to go awry- we can happily just move the sample data out to an object storage service/use a dataset already there if we like? I'll fix it up with local data & then think about that later.

@charles-turner-1 charles-turner-1 marked this pull request as ready for review February 5, 2026 06:49
@charles-turner-1
Copy link
Author

Few lines of coverage to fix, but I think this is mostly ready for review now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider using the intake-esm library

3 participants